NOW! FULLY. FUNCTIONAL 
XTURBO SYSTEMS 
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ALL PRICES ARE FOR COMPLETE SYSTEMS 


10MB HARD DRIVE 


$8.4500 
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No Hidden Costs; No Gimmicks 
A True IBM Compatible at 
Hundreds Below Competition 
DELIVERED TO YOU FULLY 
TESTED, AND READY TO:RUN 





TURBO SPEED 4.77/8 MHZ WITH 
16-BIT 8088-2 PROCESSOR 
PLUS THESE QUALITY FEATURES: 
1. Dual Speed—Keyboard Switchable 
2. 256K Installed. 640K Optional 
3. 8087 Co-Processor Socket 
4. Eight Expansion el 
5. 135 Watt Power Supp 
6. Front Panel Tabor Poper/HD Lights +Lock 
7. Can Boot-Up in Turbo Mode 
8. Unique, Heavy-Duty AT Style Case 
9. Runs all MS-DOS programs including 1-2-3, 
Flight Simulator, etc. and GW BASIC 
10. Brand New (Not Rebuilt) Famous Brand 
Hard Drive and Controller Card 
11. System Boots From Hard Drive 
12. 360K Direct Drive (Not Belt Driven) 
Famous Brand Floppy Drive 
13. AT-Style Keyboard, 84 Keys, LED Indicators 
and Large Return Key Enhanced AT Style 
Keyboard Available as Option. 
14. Monographics (Hercules Compatible) Card 
W/Printer Port 
15. High Resolution TTL Amber Screen Monitor 
Color and EGA also available. 
16. System Assembled and Diagnostic Tested 
in our Labs. 
17. One-Full-Year Limited Warranty 
18. 30-Day Return For Refund Policy 


now 410 02/-0834 
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Introduction 


Instruction Set Strategies 


“TO MICROCODE OR NOT TO MICROCODE” merely begins the list of de- 
sign questions that engineers ponder when they create new microprocessors. 
From there, engineers delve into the complexities of instruction pipelining, the 
frustrating von Neumann bottleneck, and the intricacies of efficient compiler 
construction. These myriad choices focus on the instruction set strategy envi- 
sioned for a new chip. 

To date, two prominent approaches to designing CPUs and their instruction 
sets have emerged—reduced instruction set computer (RISC) and complex in- 
struction set computer (CISC) designs. The primary difference between these 
strategies is whether to use a small number of fast-executing instructions or to 
use more complicated instructions that make writing programs easier. But there 
are other strategies as well, including one that author Phil Koopman Jr. calls a 
writable instruction set computer (WISC). The articles on the following pages 
throw some light into the dark corners of microprocessor design to help you 
understand the role of the instruction set. 

As we put together this group of articles, a common thread began to emerge— 
RISC and CISC designs are beginning to converge. Many of the best design 
ideas from each approach are being combined to create hybrid CPUs that have 
both RISC and CISC components. Both the Fairchild Clipper CPU chip set and 
the Motorola 68030, for example, exhibit this convergence of methodologies. 

To set the stage for this exploration of instruction set strategies, Phillip Robin- 
son surveys the RISC landscape to establish where the tenets of this design ap- 
proach have brought us to date. Next, Thomas L. Johnson of Motorola explains 
how the new MC68030 microprocessor incorporates many RISC-like features 
in what at first seems to be a classic CISC: machine. Mike Ackerman and Gary 
Baum’s article on the Fairchild Clipper goes a step further in demonstrating the 
convergence of RISC and CISC in that processor. 

RISC designs with their abbreviated instruction sets place a substantial bur- 
den on compilers. The new Novix NC4016 CPU uses FORTH as its instruction 
set, and Dan Miller shows how this stack-oriented RISC machine and FORTH 
contribute to the creation of an efficient C compiler. 

To close out our tour of instruction set strategies, Phil Koopman discusses the 
advantages of a writable instruction store on a stack-based microprocessor. 

For additional information on instruction set design, see the BIX conference 
apr87.sup, topic acorn.risc, for an article on the Acorn RISC processor. James 
J. Farrell Il and John F. Stockton explore the design characteristics of this chip. 

All this RISC-versus-CISC debate may ultimately prove only one thing: that 
one person’s acronym is another person’s anachronism. 

—G. Michael Vose, Senior Technical Editor 
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INSTRUCTION SET STRATEGIES 





How Much of a RISC? 


The past, present, and future 


of reduced instruction set computers 


JUST A FEW YEARS AGO, the idea of 
a new computer architecture based on 
simplified, streamlined central proces- 
sors was mainly an academic curiosity. 
Invented at IBM and shaped at Berkeley 
and Stanford, the RISC principle em- 
bodied a heresy: that most commercial 


microprocessor architecture had bloated 


far beyond the optimum level. 

Later, the heresy became a debate. In 
fact, a panel discussion entitled “The 
Great RISC versus CISC Debate” at the 
1986 COMPCON show in San Francisco 
turned into the conference’s centerpiece. 
Laboratory experiments have proved that 
lean chips with reduced instruction sets 
can run benchmark tests at fantastic 
speeds, but some system designers re- 
main unconvinced that RISC will be as 
useful in the real world of complex sys- 
tems and applications. 

To learn the state of RISC, I talked to 
some of its original proponents and sur- 
veyed the state of commercial RISC ma- 
chines. My conclusion is that RISC prin- 
ciples have greatly influenced computer 
design, even when they aren’t adhered to 
directly. Furthermore, enough RISC ma- 
chines, ranging from the microcomputer 
level to the superminicomputer level, are 
now on the market that the commercial 
potential of RISC will soon be evident, 
instead of just the subject of conference 


, debate. 


But don’t expect RISC to change the 
world right away. As one of the early 
RISC researchers, Professor John Hen- 
nessy of Stanford, puts it, ‘Market direc- 
tions always move more slowly than [the 
introduction of] technical products.” 





Phillip Robinson 


Several marketing departments have leapt 
on the RISC bandwagon with statements 
that claim a processor takes advantage of 
both RISC and complex instruction set 
computer ideas. And if you read the com- 
ments in the BIX conferences ‘“cpus/ 
risc” and “rwars/computers,” you’ ll find 
a lot of argument over what is RISC and 
what isn’t. For example, some people 
refer to the Novix chip (see the article en- 
titled “‘Stack Machines and Compiler 
Design,” by Daniel L. Miller on page 
177) as RISC, and others do not. In the 
end, most people agree that RISC isn’t 
just a smaller instruction set. It is a sim- 


plified microprocessor that jettisons all ’ 


baggage that slows the raw processing 
speed. 


Tracing Its Roots 

Both RISC and its predecessor, CISC, 
are commonly credited to IBM. The first 
CISC machine was probably the IBM 360 
mainframe, which was created in 1964. 
The 360 made extensive use of micropro- 
gramming, building instructions out of 
series of microinstructions that were in 
turn stored in ROM within the CPU. De- 
coding an instruction into a sequence of 
microinstructions requires several look- 
up operations and, therefore, multiple 
clock cycles. 

Engineers understood the additional 
clock cycles to be a natural consequence 
of putting more hardware functions into 
software. They tried to beat the rapidly 
growing expense of software by imple- 
menting more and more software func- 
tions in hardware. 

RISC began in 1975 at IBM. John 


Cocke, an IBM Fellow, was working 
with a team to make a very large tele- 
phone switching system. Such a large sys- 
tem needed a fast controller. The team 
experimented along many lines, includ- 
ing slashing the instruction set. Later, 
after abandoning the switching system 
project, the team considered using the 
controller itself as a computer. The out- 
growth of their efforts was the 801 mini- 
computer built in 1979. (The team named 
it after the number of the IBM building 1 in 
which it was made.) 

According to Cocke, “reducing the 
number of instructions was more a result 
than a cause. His team had trace statistics 
listings of how often each instruction was 
used that convinced it not to add compli- 
cated instructions to a machine when 
those same instructions could be built up 
from simpler ones without hurting per- 
formance. 

The team .designed the 801 with very 
fast memory and fixed format instruc- 
tions that could execute in a single clock 
cycle. That allowed a lot of pipelining and 
overlapping of instruction execution. 

Although the 801 never became a com- 
mercial product, the IBM RT PC work- 
station announced early in 1986 took up 
the RISC baton: It is a direct offshoot of 
the 801. Another IBM Fellow, G. Glenn 
Henry, who had previously worked on 
the System/38, was a guiding force be- 


continued 


Phillip Robinson is a contributing editor 
for BYTE and editor of the Desktop Engi- 
neering newsletter at P.O. Box 40180, 
Berkeley, CA 94704. 
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HOW MUCH OF A RISC? 
a en ese 


hind the RT. The RT’s foundation is a 
microprocessor called the research/ 
office products division microprocessor 
and a companion memory management 
chip. The ROMP works with standard 
memory, 150-nanosecond 256K-byte 
DRAMs. It has a very fast memory bus 
and can.transfer one word of data and one 
address every machine cycle (170 ns). 
The bus has 32 lines that function as ad- 
dress lines during half of the cycle and as 
data during the other half. Figure 1 shows 
the RT’s processor-board block diagram. 

The ROMP has 118 instructions, less 
than half the number of the DEC VAX- 
11/780, the computer most often used 
today as a standard for speed and an ex- 
ample of CISC architecture. That, how- 
ever, is nearly three times as many in- 
structions found in other RISC designs. 
That midpoint status, along with the fact 
that instead of a single instruction format 
the ROMP has seven different formats, 
leads some RISC proponents to say that 
the RT isn’t a true RISC machine. Still, it 
has many of the elements of a RISC ma- 
chine, and until other RISC micros ap- 
pear, the RT’s commercial success or 
failure might represent the success or 
failure of the RISC concept to the buying 
public. 


RISC Defined 

RISC is more than just a small instruction 
set. David Patterson, a professor at the 
University of California at Berkeley 
whose group first coined the term, says 
that the definition of RISC is a matter of 
constant debate in the computer architec- 
ture community. However, there are a 


few points that are commonly accepted. . 

First, a RISC machine must execute 
one instruction each clock cycle. Traces 
of computer programs consistently show 
that the most heavily used instructions 
are the primitives. With proper design, 
engineers can write these to run in a sin- 
gle clock cycle. That simplifies pipelin- 
ing, interrupts, and a host of other micro- 
processor design attributes. Sticking to 
primitives, however, requires compilers 
to use more software subroutines for 
complex procedures. 

A major argument against RISC has 
been that the processors will need to use 
so many more of the simple instructions 
in the place of powerful, complex instruc- 
tions that the increase in path length 
(number of instructions to get the job 
done) will negate the advantage of run- 
ning each instruction faster. According to 
Professor Hennessy of the Stanford MIPS 
(microprocessor without interlocked pipe 
Stages) project, RISC machines pay 
around a 30 percent penalty in added in- 
structions over microcoded machines. 
However, he says, “‘We are willing to take 
a 30 percent hit in return for a fivefold im- 
provement in cycles per instruction.” 

Second, a RISC machine must use a 
fixed format. for the instructions. Doing 
so makes decoding simple. Assigning 
each field to a particular function allows 
hardwiring of the instructions, and avoid- 
ing microcode adds more speed. Only 6 
percent to 10 percent of the chip area of 
the Berkeley RISC I and II chips was de- 
voted to control functions, while 50 per- 
cent to 60 percent of the total chip area in 
a 68000 or Z8000 is the control section. 
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Figure 1: The IBM RT PC processor-board block diagram. 
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Third, RISC machines stick to a load/ 
store architecture. That means the only 
instructions that deal with memory are 
simple load or store instructions. All 
other manipulations take place inside the 
microprocessor registers. This arrange- 
ment simplifies addressing and makes it 
easier to restart instructions for exception 
conditions. It also requires a large num- 
ber of on-chip registers, a common fea- 
ture of RISC chips and one that some de- 
tractors claim is the main reason for 
improved performance. 

Finally, RISC machines require more 
compile-time effort than CISC machines. 
Because of RISC’s relatively few instruc- 
tions and addressing modes, more effort 
should go into compilers that can order 
the primitive instructions in the most effi- 
cient manner, tailoring the instruction se- 
quences to the exact requirements of the 
high-level language chosen. 


The Berkeley Camp 

A team including Dr. David Patterson de- 
veloped the RISC I and II chips (see 
“RISC Chips,” November 1984 BYTE). 
While these chips were based on the pre- 
vious IBM work, they also established 
some standards for RISC. After the suc- 
cessful design and fabrication of simple 
32-bit microprocessors that ran from 2 to 
10 million instructions per second peak, 
the Berkeley team tackled a project called 
SOAR (Smalltalk on a RISC). This proj- 
ect resulted in another RISC chip that was 
dedicated to running Smalltalk, and it 
proved, according to Patterson, “that you 
don’t really need anything beyond a RISC 
machine to run Smalltalk.” | 
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HOW MUCH OF A RISC? 





The University of California at Berke- 
ley team turned further to symbolic pro- 
cessing with the SPUR (symbolic pro- 
cessing using RISCs) project. Its goal 
was to design a multiprocessor worksta- 
tion for conducting parallel-processing 
research. The research focused on using 
LISP. The SPUR project is complex, in- 
cluding workstation development and re- 
search efforts in integrated circuits, com- 
puter architecture, operating systems, 
and programming languages. The SPUR 
system is built around 6 to 12 high-per- 
formance, CMOS, homogeneous RISC 
processors. The team chose the number 
of processors to permit parallel-process- 
ing experiments within a package small 
enough to be a personal workstation. 

The SPUR processor supports Com- 
mon LISP and IEEE floating-point pro- 
cessing. It uses three custom 2-micron 
CMOS chips: a cache controller, CPU, 
and floating-point coprocessor unit. Fig- 
ure 2 shows the arrangement of these on a 
processor board. The CPU is based on 
the Berkeley RISC architecture and uses 
a simple, uniform pipeline with hard- 
wired instructions and a large register 
file. It tries to stick to an instruction per 
clock cycle. SPUR goes beyond RISC II 
in that it adds a 512-byte instruction 
buffer, a fourth execution pipeline stage, 
a coprocessor interface, and support for 
LISP tagged data. Figure 4 compares the 
SPUR and RISC II pipelines. 


AMD Building Blocks 

Another processor based on the Berkeley 
RISC design is one that Advanced Micro 
Devices cooked up from two families of 
its VLSI chips: the bipolar Am29300 and 
the CMOS Am39300. AMD can use the 
chips in these families to make 32-bit, 
fixed-word-length RISC chip sets. The 
29300 can support cycle times of 80 ns. 
The pipeline in the AMD RISC chips is a 
two-level instruction-fetch-and-execute 
scheme. Both chips have a 4-gigabyte ad- 
dressing capability for virtual memory 
structures. 

The Am29334 four-port, dual-access 
register file; the Am29332 ALU (which 
includes a barrel shifter and a 64-bit-in, 
32-bit-out funnel shifter); and the 
Am29337 bounds checker are the basis of 
the AMD design. Several 29334 register 
file chips can be teamed to make larger 
register blocks. AMD has published a 
RISC processor design based on these 
components that closely resembles the 
Berkeley RISC I chip. Figure 3 shows a 
block diagram of that processor. 

AMD uses the 29334 register file chip 
in the AMD RISC design to duplicate the 
overlapping register windows of the 
Berkeley RISC design (see ‘‘RISC 
Chips” by John Markoff, November 





| 


1984 BYTE). The overlapping improves 
context-switching speed. The Berkeley 
research showed that one of the largest 
shares of CPU time was spent changing 
processor status for moves between dif- 
ferent program procedures or routines. 
With the register windows, the chip does 
not have to copy all register values to 


 culaiienienientomtetentemtantentententenr 


memory when going to a routine and then 
read them back when returning from a 


routine. Instead, each procedure is’ as-' 


signed one register window, and the pro- 
gram can change procedures just by 
switching the active register window 
within the register file in the CPU. 


continued 


CACHE RAMs 





Figure 2: The SPUR processor-board block diagram. 
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UNIX TOOLS FOR YOUR PC 


Inquiry 88 




















PC/Vr 


UNIX’s VI Editor Now Available 
For Your PC! 


Are you being as productive as you can 
be with your computer? An editor should be 
a tool, not an obstacle to getting the job 
done. Increase your productivity today b 
choosing PC/VI — a COMPLETE 
implementation of UNLX® VI version 3.9 (as 
provided with System V Release 2). 


PC/VI is an implementation of the most 
powérful and most widely used full-screen 
editor available under the UNIX operating 
system. The following is only a hint of the 
power behind PC/VI: 


« Global search or search and replace using 
regular expressions 

¢ Full undo capability 

Deletions, changes and cursor 

positioning on character. word, line. 


Sentence, paragraph. section or 
global basis 

+ Editing of files larger than available 
memory : 


Shell escapes to DOS 

Copying and moving text 
Macros and Word abbreviations 
Auto-indent and Showmatch 
MUCH, MUCH MORE! 


Don't take it from us. Here’s what some 
of our customers say: “Just what I was 
looking for!”. “It's great!”, “Just like the real 
VI!". “The documentation is so good I have 
already learned things about VI that I never 
knew before.” — IEEE Software. 
September 1986. 


PC/VI is available for IBM-PC’s and 
= MS-DOS+# systems for only $149, 
ncluded are CTAGS and SPLIT utilities. 
TERMCAP function library, and an IBM-PC 
specific version which enhances 
performance by as much as TEN FOLD! 


PC/TOOLS 


What makes UNIX so powerful? 
Sleek, Fast, and POWERFUL utilities! UNLX 
fives the user not dozens, but hundreds of 
lools. These tools were designed and have 
been continually enhanced over the last 
fifteen years! Now the most powerful and 
popular of these are available for your PC! 
Each is a complete implementation of the 
UNIX program. Open up our toolbox 
and find: 


« BFS ¢ DIFF3 « SED 

e CAL « GREP e SEE 

e CUT e HEAD e STRINGS 
e DIFF * OD ¢ TAIL 

e DIFFH e PR « WC 


All of these for only $49.00: naturally. 
extensive documentation is included! 


PC/SPELL 


Why settle for a spelling checker which 
can only compare words against its limited 
dictionary database when PC/SPELL is 
now available? PC/SPELL is a complete 
implementation of the UNIX spelling checker, 
renowned for its understanding of the rules 
of English! PC/SPELL determines ifa word 
is correctly spelled by not only checking its 
database, but also by testing such 
transformations as pluralization and the 
addition and deletion of prefixes and suffixes. 
For only $49.00, PC/SPELL is the first and 
last spelling checker you will ever need! 

| dy OF ae Pee ns >| 

Buy PC/VI and PC/TOOLS now and get 
PC/SPELL for only $1.00! Site licenses are 
available. Dealer inquiries invited. MA 
residents add 5% sales tax. AMEX, MC and 
Visa accepted without surcharge. Thirt 
day money back guarantee if not satisfied! 
Available in 8” 5'A” and 3'A” disk formats. 
For more information call today! 


“UNIX is 4 trademark of ATAT (MS DOS ts « trademark af Microsoft, 


CUSTOM SOFTWARE SYSTEMS 


P.O. BOX 678 » NATICK, MA 01760 
617 653 « 2555 


_ \s ryt 
Ke 4 
ogee 
} ; . 



























































3 fhe sia a = re . . 
oe Sulcus BpAe AREA hk 


*\! eo 
fete Marder ER + yy 
> id nie Wee teas 
Ek eri Ree eS 


Jd YNOA WHOA SIOOL XINN 


HOW MUCH OF A RISC? 


Four 29334 chips can form the basis of 
a register block with seven register win- 
dows and 10 global registers. Each win- 
dow has 32 registers split into 10 global 
registers, 10 local registers, 6 incoming- 
parameter registers, and 6 outgoing-pa- 
rameter registers. The Berkeley RISC de- 
sign had 138 registers in eight windows. 
The global registers are available to all 
procedures. 

Each one of the 33 instructions is 32 
bits long with a fixed format. The op code 
occupies a 7-bit field. The design has 23 
more bits organized into three fields that 
specify the two source operands and the 
one destination. All instructions are de- 
coded by running the 7-bit op code 
through an on-chip programmable-logic- 
array section containing the control logic. 

While the chip executes one instruc- 
tion, it fetches the succeeding instruction 
from memory. To handle conditional 
branch instructions, AMD uses a delayed 
branch. The compiler used with the pro- 
cessor contains a code reorganizer that 
rearranges the sequence of instructions so 
that the one following the branch instruc- 
tion is always executed no matter what the 
branch condition. AMD claims that in 
nine out of ten cases the succeeding oper- 
ation can be useful. In the tenth case, it is 
a time-wasting NOP instruction. 


The Stanford Camp 

Professor John Hennessy of Stanford 
University was one of the other early aca- 
demic stalwarts of RISC. He helped put 
together the Stanford MIPS chip project. 
After MIPS succeeded in making a fast 
simple chip, the Stanford group turned to 
symbolic processing, much as the Berke- 
ley group did. But their paths diverged. 
While Patterson’s team customized its 
chip to symbolic-processing languages 
such as Smalltalk and LISP, the Stanford 
team, according to Hennessy, went for 
raw speed. 

This project, termed MIPS-XMP, pro- 
duced a 100 percent fully functional 32- 
bit microprocessor chip on the first fabri- 
cation round. The chip was designed to 
run at 20 MIPS peak, and the Stanford 
team currently has parts that work at 17 
MIPS peak. Of the 125,000 transistors 
on the chip, only 25,000 to 30,000 are 
nonmemory functions. The rest include a 
big on-chip cache and 32 general-pur- 
pose registers. 

After the original MIPS project, Hen- 
nessy temporarily left Stanford to help 
form MIPS Computer Systems of Sunny- 
vale, CA. He still works for the firm as 
chief scientist, while keeping his position 
at Stanford. MIPS has introduced a series 
of RISC boards and systems rooted in the 
Stanford MIPS experience, but it uses 
wholly new designs. The company has 
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announced parts with speeds up to 16 
megahertz. Both Patterson and Hennessy 
see MIPS computers as the most pure 
commercial versions of RISC ideals. 

The MIPS boards and chip sets are 
meant to be the building blocks for super- 
minicomputers. The CPU cards include 
3-, 5-, 8-, and 10-MIPS machines based 
on a custom 32-bit processor with 10- 
MIPS performance. This chip has 32 
general-purpose registers, instruction 
support for three external coprocessors, 
and hard-wired machine code. It doesn’t 
have hidden registers, condition codes, 
variable-length instructions, or multiple- 
address modes. 


The HP Spectrum 

MIPS and Hewlett-Packard made parallel 
efforts in studying instruction use in pro- 
gramming. HP has chosen an architec- 
ture that it calls “beyond RISC” to be the 
foundation for all its new-generation 
computers. The Spectrum systems are 
overdue, with a reported software prob- 
lem holding them up long enough to em- 
barrass HP, but they promise great speed 


-and compatibility with their popular pre- 


decessor, the HP 3000 minicomputer. 

That compatibility might have been a 
primary reason that HP chose to deviate 
from a strict RISC processor strategy. 
According to William Worley, the princi- 
pal architect of the HP Spectrum line, 
“We have to distinguish between archi- 
tecture and implementation. Some imple- 
mentations will realize all of the theoreti- 
cal efficiencies of RISC. Not all will. 
Each implementation needs to make 
sense as a business proposition.” 

HP made extensive studies of instruc- 

tion use and tried to keep the CPU toa 
minimum configuration that would also 
provide compatibility with the previous 
generation. Joel Birnbaum, the man HP 
recruited to start its RISC effort, says 
that, “Whenever someone suggested we 
really ought to have a wonderful instruc- 
tion, like ‘test left, shift mask, dim the 
lights,’ we had to ask: ‘How often will we 
execute it, and what is the performance ~ 
degradation?’ ” HP wanted op code com- 
patibility along with peripheral subsys- 
tem, interrupt response, and I/O com- 
patibility. In fact, the company claims 
that the next step in the Spectrum line is 
RISC I/O, that is, direct attachment to the 
bus from any peripheral. 
. Although the Spectrum chips will ap- 
pear in workstations, HP has given most 
of the attention to their use in mini- 
computers. 


An ARM for AI 
ARM (Acorn RISC machine) is the name 
for a RISC chip developed by one of Brit- 
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ain’s leading personal computer makers. 
(Acorn is presently a subsidiary of Oli- 
vetti.) The company wanted a new pro- 
cessor for AI applications involved with 
Britain’s fifth-generation computing 
project, Alvey. | 3 

What Roger Wilson, Acorn’s senior 
software designer, wanted was a 32-bit 
microprocessor with some of the advan- 
tages of the 6502 from MOS Technology. 
Specifically, that meant good interrupt- 
handling capability. Wilson felt that 
many 16- and 32-bit chips lagged behind 
the 6502 in interrupt handling. In 1983 he 
began looking for new designs. 

The work on RISC caught his atten- 
tion, and over the next 18 months, a four- 
man design team from Acorn used soft- 
ware tools from VLSI Technology Inc., 
of San Jose, California, to structure a sin- 
gle 25,000-transistor chip that comprised 
a full 32-bit microprocessor: the ARM. 
(VLSI has nonexclusive marketing rights 
to the chip.) This chip is already available 
on evaluation boards and will eventually 
form the basis of a new family of Acorn 
products. 

The first samples of the ARM were 
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complete in April 1985. They were fabri- 
cated in 3-micron double-metal CMOS 
and occupied 50 square millimeters of 
silicon real estate. This is significantly 
smaller than most microprocessors. For 
instance, the 68020 puts 192,000 transis- 
tors onto a chip of 80-mm?. That differ- 
ence in size means better yields for ARM 
production and therefore much lower 
prices for the chips. 

Acorn claims that the ARM has a fac- 
tor-of-4 advantage in that respect already 
with the 3-micron chip. According to 
Simon Woodward, the ARM systems 
product manager, Acorn plans to scale 
the design for fabrication in 2-micron 
double-metal CMOS, a process that 
should generate a 30-mm? chip. 

At that size, the ARM might have as 
much as a factor-of-10 advantage in price 
over commercial 32-bit microprocessors. 
“We get better performance than a Mo- 
torola or Zilog with about one-tenth the 
number of components and a much 
smaller chip,” says Steve Furber, senior 
designer at Acorn’s business division. 

According to Acorn, the 2-micron ver- 
sion should be able to produce 3 MIPS. 
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That’s about twice the CPU performance 
of a VAX-11/780. Even at that pace, it 
can work with comparatively cheap, slow 
(4-MHz) RAM chips.”“Some other fast 
microprocessors require expensive, high- 
speed RAM to make a fast system. 

The present ARM chips run at around 
4 MIPS, peaking at something over 6.6 
MIPS. They are primarily intended for 
use within microcomputers, but because 
of their extremely good interrupt re- 
sponses, Woodward suggests they would 
do well in real-time systems and industri- 
al controllers. The fast processing is not 
limited to LISP, as some press reports 
have claimed. ARM chips can support a 
wide range of high-level languages. How- 
ever, running LISP, they can outperform 
a Symbolics 3670 workstation, according 
to Acorn. 

Since the chips began to appear in 
1985, Acorn has been working on sup- 
port chips and evaluation systems based 
on the ARM. Acorn has designed a set of 
ARM.-related controller chips that handle 
memory, I/O, and video. The video chip 
also has a full sound system on it. The 
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Universities continue 
to explore RISC, with 
processors aimed at 
symbolic processing. 





ARM can work without these chips, but 
they make a more powerful system if 
combined with it. One of the evaluation 
systems was on the market at press time, 
and the other was scheduled for early this 
year. Woodward says Acorn is “looking 
at a number of possibilities of distribution 
in the states,” but the firm “‘can handle 
direct contact with U.S. companies and is 
doing so now.” 


Gallium Arsenide and RISC 
Gallium arsenide (GaAs) can be used in 
place of silicon for many integrated-cir- 
cuit purposes. It has several key advan- 
tages, including radiation hardness and 
less temperature sensitivity, but its key 
advantage to most computer makers is its 
speed. Because electrons travel faster in 
GaAs, circuits built on GaAs wafers can 
be faster than the same circuits on silicon. 
The main disadvantages of GaAs are 
that it is harder to work with than silicon 
and the chip makers have less experience 
with it. However, the state of the art in 
GaAs-chip technology has produced 
parts that have several thousand gates or 
memory cells. Although GaAs technol- 
ogy hasn’t yet reached far enough to pro- 
duce CISC processors, it can be used to 
implement a simple processor such as a 
RISC CPU, especially if that CPU is par- 
titioned into a chip set. 


1/0 PROCESSOR 
BOARD 


20-ns CLOCK 


However, the Department of Defense 
needs high-speed, radiation-hard proces- 
sors, so it decided in the spring of 1984 
that DARPA (Defense Advanced Re- 
search Projects Agency) would award 
contracts to several firms to design and 
make GaAs RISC processors. For high 
performance, DARPA wanted a single- 
chip microprocessor, and the only single- 
chip design that met all its needs was the 
MIPS chip developed under a DARPA 
grant at Stanford. Texas Instruments, 
RCA, and McDonnell-Douglas were 
chosen as the first-round contractors to 
make a single-chip, all-GaAs, MIPS 
microprocessor that could run at 200 
MHz. Because of the RISC emphasis on 
an instruction every cycle, that: would 
mean peak performance approaching 200 
MIPS. On the way to that goal, DARPA 
expected the contractors to make various 
GaAs chips that embodied part of the 
final design. 

Philip Congdon, manager of gallium 
arsenide systems and components at TI, 
says that a team from Control Data Corp. 
and TI has designed the CPU and is near- 
ly done designing the floating-point pro- 
cessor to accompany it. The chips have 
about 10,000 transistors in a bipolar tech- 
nology that uses only one transistor per 
gate. | 

Although the 200-MIPS figure is the 
peak performance, because of the 5-ns 
clock cycle, a more realistic figure would 
first subtract approximately 32 percent 
for NOPs in the pipeline and then de- 
crease what’s left by 32 percent for inade- 
quate bandwidth to memory. Even using 
GaAs memory chips, the system is ham- 


-pered by radio-frequency effects in the in- 


terconnects between chips and is even re- 
strained by the speed of light: Electrons 
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can travel only about five feet in a single 
clock cycle. The sustainable performance 
rate of the TI chips will be about 92 
MIPS, about 10 times the capability of 
other fast RISC processors such as the 
MIPS or RISC II chips. Congdon expects 
to see the first chips no later than mid- 
1987. Figure 5 is a block diagram of the 
TI/CDC GaAs RISC system. The next 
chip in the series would be a memory 
management unit. 

While CDC might employ these chips 
in a supercomputer, TI will be selling 
them on boards as computer-system dem- 
onstration units to the government. When 
such chips might hit the commercial mar- 
ket is an open question. 

Another computer that reportedly in- 
cludes GaAs parts is the upcoming Cray- 
3 supercomputer. Seymour Cray’s de- 
signs are often referred to as RISC-like 
because of his drive for simplicity and 
speed. 


High Native Instruction Rates Win 
The RISC idea has had a huge impact on 
computer architecture. Even designers 
who aren’t embracing it are borrowing 
from it. With a raft of new commercial 
pure and not-so-pure RISC products ap- 
pearing from the micro to the supermini 
level, there is no doubt that RISC has out- 
grown the stage of academic exercise. At 
the same time, universities continue to 
explore the RISC idea, particularly with 
processors aimed at symbolic processing. 
How well the eyebrow-raising instruc- 
tion rates of these new, simplified chips 
will translate into practical processing 
power and commercial success is not cer- 
tain. The next two or three years should 
provide some answers. Stanford’s Hen- 
nessy believes that “the state ofthe art in 
compiler technology has just started to 
improve dramatically in the last year.” 
He adds, “One of the real breakthroughs 
is progress in the register allocation 
area.” Interestingly, the original work 
was done at IBM on the 801 project. 
Stanford carried on the research and then 
a RISC team at DEC’s Western Research 
Lab did more, including work with new 
algorithms that reinforces the value of 
general-purpose registers. This should 
lead to even greater advantages for RISC 


‘ processors. 


Referring to the flagship of Digital v 


‘Equipment’s minicomputer line as a | 


well-known standard, Patterson bluntly 
states, “I think in the next few years a lot 
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of companies will come out with inex- 
pensive RISC machines that will be faster 
than DEC’s 8600.” He continues, “I'd | 
’ say, if you were going to design a brand © 
new instruction set today, you’d have to 
be real stubborn not to employ some of — 
the RISC ideas.” w 





Figure 5: The TI/CDC GaAs RISC-system block diagram. 
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_ The Fairchild Clipper 


A microprocessor that attempts to balance 


THE FAIRCHILD CLIPPER processor 
differs from other commercial 32-bit 
microprocessors architecturally as well 
as mechanically. Its features include a 
balanced instruction set, high-bandwidth 
dual buses, caching, hardware-managed 
pipelining and resource allocation, con- 
current processing units, and hardware- 
based operating system support. More- 
over, it can process up to 33 million 
instructions per second. 

The processor comes as a preas- 
sembled module. Physically, it com- 
prises a set of three CMOS VLSI chips 
and a smaller CMOS clock generator, 
which partition processing and memory 
features to minimize interchip traffic. 
These chips include the combined CPU/ 
floating-point unit; the two identical 
cache/memory management units 
(CAMMUs), one for data and the other 
for instructions; and the clock generator 
chip, which distributes the required clock 
signal. 


Balancing CISC and RISC 
Clipper’s instruction set fosters fast-exe- 
cuting compiled code from compilers 
that optimize register use. Unnecessary 
operations have been eliminated. The re- 
maining operations are relatively simple 
for globally optimizing compilers to 
work with. These RISC-like instructions 
are implemented in fast-acting hard- 
wired logic; most frequently used in- 
structions execute in one 30-nanosecond 
clock cycle. 

In accord with RISC philosophy, Clip- 
per is essentially a load/store machine in 
which all arithmetic and logical instruc- 


the best of CISC and RISC 





Mike Ackerman and Gary Baum 


tions operate only on data in registers; 
only loads, stores, branches, calls, and 
stack manipulations access memory. The 
hardware architecture provides the 
needed registers: thirty-two 32-bitters, 
sixteen for the operating system and six- 
teen for user programs. To simplify and 
speed up decoding, all instructions are 
formatted as multiples of 16-bit parcels. 
The most frequently used instructions are 
shortest. 

The instruction set includes 101 hard- 
wired and 67 high-level macroinstruc- 
tions that operate on the basic data types. 
Each instruction specifies the operation 
to be performed, plus the type and loca- 
tion of its operands. These operands can 
reside in memory, in a register, or within 
the instruction itself. To speed decoding, 
all instructions contain from one to four 
16-bit parcels. Figure 1 details Clipper’s 
instruction formats. 

These instruction formats fall into two 
groups, those with addresses and those 
without. Instructions with addresses are 
those that must access memory, such as 
loads, stores, and branches. Instructions 
without addresses are the arithmetical 
and logical types and generally can exe- 
cute in one clock cycle. Although instruc- 
tions can have zero, one, or two oper- 
ands, only one operand can access a 
memory address. 

Clipper’s instruction set consists of 10 
functional categories. Load/store instruc- 
tions transfer addresses, bytes, half- 
words, words (32 bits), longwords, and 
floating-point quantities (single and dou- 
ble) between memory and registers. 
Move instructions transfer 32- and 64-bit 


quantities between registers (integer and 
floating point). 

Arithmetic instructions operate on reg- 
ister contents or intermediate values of 
variable length. These include add, sub- 
tract, multiply, divide, negate, modulus, 
and scale operations. Logical instructions 
operate on register contents. These in- 
clude AND, OR, exclusive-AND, and 
NOT operations. 

Shift/rotate instructions operate on 
words and longwords. Conversion in- 
structions can change single- or double- 
precision floating-point numbers into in- 
tegers rounded to IEEE specifications. 
Compare instructions test the value of 
words or floating-point numbers of either 
precision; an atomic test-and-set instruc- 
tion is also included. 

String instructions (compare, initial- 
ize, and move) manipulate character 
strings. Stack instructions manage pro- 
gram and system stack. These include 
push, pop, save multiple registers, and 
restore multiple registers. Control in- 
structions include branch, call, call su- 
pervisor, return, and NOP. 

Scattered throughout these 10 catego- 
ries are the 67 CISC-like macroinstruc- 
tions. For example, all conversion and 
string instructions are macros; except for 
push and pop, some stack operations are 
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macros; and some move, arithmetic, and 
control instructions are also macros. Ex- 
cept for format, however, programmers 
should see no difference between macros 
and faster-acting hard-wired elemental 
instructions. In fact, each macroinstruc- 
tion is implemented in the CPU’s macro- 
instruction unit as a sequence of the hard- 
wired ifistructions. 


Additional CISC-related features in- 
clude a complete set of nine addressing 
modes for load/store instructions to fa- 
cilitate access to the complex data struc- 
ture elements (e.g., arrays, records, and 
arrays of records) of typical high-level 
languages. Clipper provides separate 
modes, with dedicated resources and 
unique privileges, for users and the oper- 


ating system. Moreover, hardware sup- 
port exists for key OS functions such as 
system calls, exception ene and vir- 
tual memory. ns, 

Clipper provides nine memory-ad- 
dressing modes (see figure 2) to specify a 
unique virtual address as the sum of sev- 
eral factors. With the relative mode and 

continued 
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Figure 1: Clipper’s balanced instruction set blends 10] RISC-like hard-wired streamlined instructions with 67 CISC-like macros. 








the two relative-with-displacement 
modes, the virtual address in question 
either is in a register or is to be computed 
as the sum of the values in a specified reg- 
ister and the displacement carried with 
the instruction itself. The two absolute 
modes carry within the instruction the 
virtual address as a pure displacement 
value. The two program-counter relative 
modes facilitate branching relative to the 
program counter’s current value. The 
two indexed modes sum the two specified 
register values to arrive at the virtual 
address. 

These addressing modes facilitate ac- 
cess to data structures, such as arrays and 
records, commonly used in high-level 
languages. Figure 3a maps how the rela- 
tive-plus-displacement mode accesses an 
‘array entry. Items in a two-dimensional 
array are accessed via relative indexing in 

figure 3b. | 

In the one-dimensional array, the dis- 
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placement value points to the array’s 
base, while the register value defines the 
offset of the item at hand. Simply incre- 
menting the register by a fixed amount 
causes the same addressing mode to point 
to the next item in the array. Thus, a se- 
quence of items from an array can be 
quickly accessed using a loop with only 
one basic instruction and a fixed address- 
ing mode, 

In the two-dimensional array, one reg- 
ister value points to the first item in the 
selected row, while the second register 
defines the item’s offset within the row. 
Incrementing the first register yields the 
same offset in a new row, while incre- 
menting the second register points to the 
next item in the same row. Thus, the in- 
struction stays the same while entries in a 
whole set of two-dimensional arrays are 
accessed by simply incrementing regis- 
ters. Except for the initial instruction 
fetch, all operations can proceed on-chip. 


REGISTER 


ADDRESS 


Clipper supports 10 distinct basic data 
types. These comprise both signed and 
unsigned versions of bytes, 16-bit half- 
words, 32-bit words, and 64-bit long- 
words; 32-bit single-précision and 64-bit 
double-precision floating-point numbers 
that conform to the IEEE standard are 
also included primarily for technical or 
workstation applications. These primi- 
tive data types can serve as building 
blocks for the more complex structured 
data types, such as arrays and records. 


‘Exception Handling 


Exceptions are those internal hardware 
conditions, external events, or even par- 
ticular instructions whose detection 
causes the system to suspend normal pro- 
cessor operation and in its place perform 
some special predetermined sequence of 
operations. 

Essentially, exceptions fall into three 
categories: traps, interrupts, and super- 
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Figure 2: Nine modes for addressing the virtual address space help systems and programmers deal efficiently 


with data structures. 
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visor calls. Traps are anomalous internal 
events that can occur while an instruction 
is being processed. Classic examples 
range from simple attempts to divide by 
zero to complexities such as a virtual 
memory system page fault. Interrupts are 
a means for external devices to signal toa 
CPU that they need servicing. An exam- 
ple here could be a DMA controller sig- 
naling that it has finished transferring a 
block of data into memory. Supervisor 
calls are program-generated requests for 
services that the operating system 
provides. 

When one of these exceptions occurs, 
the appropriate software handler must be 
invoked—usually as soon as possible. 
This need for immediate action means 
that exceptions must suspend normal pro- 
cessing. When the handler finishes its 
task, control returns to the point at which 
the program halted. 

Unfortunately, the same pipelining that 
boosts system throughput complicates ex- 
ception handling. When an exception oc- 
curs, the pipeline must clear to allow for 
processing the exception handler as soon 
as the currently executing instruction is 
finished. When the exception handler is 
completed, normal processing resumes. 
Then the pipeline must refill and the in- 
struction pointer back up to refetch the 
next program instruction. Because Clip- 
per executes multiple instructions con- 
currently, simultaneous multiple excep- 
tions, such as a divide by zero that occurs 
in the same clock cycle as a page fault 
or floating-point fault, present added 
complications. | 

The architecture supports 18 traps, 256 
vectored interrupts, and 128 programma- 
ble supervisor calls. The traps handle 
page faults, attempts at violating memory 
protection, floating-point errors such as 
an overflow, arithmetic errors such as 
trying to divide by zero, and violation of a 
privileged instruction by a user-mode 
program. Any of these conditions causes 
the hardware to generate the appropriate 
trap. 

Clipper also handles priority-encoded 
interrupts. It encodes the interrupt type 
as one of the possible 256 on the byte- 
wide interrupt bus and invokes super- 
visor calls by executing a calls instruc- 
tion. Within the instruction, a parameter 
specifies the call type. 

Clipper handles all exceptions in much 
the same way. Initially, the current con- 
tents of the program counter (PC), the 
supervisor-status-word (SSW) register, 
and the program-status-word (PSW) reg- 
ister are saved on the supervisor stack. 

Next, a new SSW and PC are copied 
from the vector table. This is a data struc- 
ture that occupies the first real page of 
memory. The vector table contains the 
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address and SSW value for every excep- 
tion-handler routine. Address/SSW pairs 
are stored in vector table locations corre- 
sponding to their particular type of trap, 
interrupt, or call. The exception-han- 
dling software executes using the new 
SSW and PC values. After processing the 
exception, the handler routine executes a 
return-from-interrupt instruction. This 
restores the old PC, SSW, and PSW val- 
ues from the supervisor stack. Finally, 
the program picks up from where it had 
halted. — 

Clipper’s load/store-type operation re- 
quires extensive hardware register sup- 
port. If most of the instructions are to op- 
erate on information in registers, the 
registers must be available. The register 
complement includes a 32-bit PC and 
thirty-two 32-bit general-purpose storage 
registers that can accommodate addresses 
or data words. General-purpose registers 
Save program steps when compared to ad- 
dress- and data-dedicated registers by 


(a) 


INSTRUCTION 


REGISTER 


eliminating unnecessary register-to-reg- 
ister transfers when performing arithme- 
tic operations on addresses. Clipper also 
contains eight 64-bit registers for. float- 
ing-point arithmetic. 

In support of multiuser operating sys- 
tems, Clipper has two operating modes: 
user and supervisor. These modes are 
distinguished by the registers they have 
access to and by the instructions each can 
use. Programs executing in supervisor 
mode (usually the operating system) have 
access to the data in all thirty-two gener- 
al-purpose and eight floating-point regis- 
ters. Access for user-mode programs is 
restricted to only sixteen of the general- 
purpose registers, called the user regis- 
ters, and to all the floating-point regis- 
ters. The sixteen registers inaccessible to 
user programs are supervisor registers. 


An additional sixteen 32-bit registers and 


four 64-bit floating-point registers are 
available to the macroinstruction unit and 
continued 
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Figure 3: (a) Relative addressing provides a simple means for accessing any 
array item. (b) Relative indexing facilitates the more complex task of accessing items 


in two-dimensional arrays. 
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are hidden from the user. 

The CPU/FPU chip accommodates 
two 32-bit status words, the PSW and 
SSW. Both status words contain flag bits 
that identify and control the CPU. The 
PSW, which is accessible to both modes, 
contains the exception flags. The SSW, 
which is accessible only to supervisor- 
mode programs, contains the flags for in- 
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terrupts, address translation and protec- 
tion, and modes of operation. 

Each of the two CAMMUs in the pro- 
cessor module contains five software-ac- 
cessible registers for initialization and 
control. Two of these registers (page di- 
rectory origin registers, or PDOs) con- 
tain the base addresses of the supervisor 
and user page-table directories that are 


used for memory-page translation. An- 
other register (fault) contains a virtual ad- 
dress pertaining to a particular fault con- 
dition, so that the operating system’ can. 
use its contents in support of virtual 
memory processes. The two remaining 
registers (control and reset) help control 
the CAMMUs. Figure 4 shows all the 
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Figure 4: The CPU/FPU chip and CAMMU register sets for user and supervisor modes. | | 
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register sets for user and supervisor 
modes on a module. 


Dual-Bus Bandwidth 

Clipper uses two buses between its 
CPU/FPU and the CAMMUs. Each bus 
is dedicated—one to data and the other to 
instruction traffic. The two-bus system 
effectively more than doubles the single- 
bus bandwidth by eliminating bus arbitra- 
tion. In addition to raising the bandwidth, 
the two buses and two CAMMUs in- 
crease the caching operation’s efficiency. 

Burst-mode transfers also enhance the 
bandwidth over the Clipper bus for infor- 
mation exchanges between the processor 
module and main memory. Through 
these, the module can send 16 data bytes 
(four 32-bit words) for each address 
word. Figure 5 shows the relationship of 
the module’s VLSI chips to the dual in- 
ternal and system buses. 

Figure 5 also shows the unique parti- 
tioning method for the chip set. Instead of 
an integer ALU alone, the CPU/FPU 
also contains a floating-point arithmetic 
processor that can perform more than 2 
million floating-point operations per sec- 
ond (MFLOPS). The result is faster float- 
ing-point operation than if the data and 
control signals had to pass off-chip to a 
floating-point coprocessor. Moreover, 


INTERNAL INSTRUCTION BUS 


INSTRUCTION 


CAMMU 


FAIRCHILD CLIPPER 
SE SE EE 


the architecture allows floating-point and 
integer operations to proceed simul- 
taneously. 

Functional partitioning of the cache 
and memory management functions is 
evident in figure 5. Dual CAMMU chips 
integrate all memory access functions 
onto dedicated chips for data and instruc- 
tions. Each CAMMU contains a 4K-byte 
cache plus its control logic and the man- 
agement logic to support demand-paged 
virtual memory with ample 4K-byte 


_ pages. 


Memory Caches 

Clipper’s closely integrated caches 
bridge the gap between its 30-ns CPU/ 
FPU and the 500-ns main-memory sys- 
tems that are practical using 150-ns 


DRAMs. The full hierarchy extends from 


30 ns for CPU registers to over 30 milli- 
seconds for data on disks. Bridging this 
million-times gap in access times are the 
main memory and two separate caches, 
one within each CAMMU. One caches 
data, while the other caches instructions. 
Each cache is in reality a two-level mech- 
anism: The 4K-byte main caches each 
contain a quadword buffer that is, in ef- 
fect, a smaller high-speed virtual cache. 
Information within the main caches is 
organized into two sets of 128 lines each, 


CPU/FPU 


with a line holding a 16-byte quadword. 
A cache access causes the entire line con- 
taining the accessed item to be loaded 
into the quadword buffer.. Subsequent se- 
quential accesses to information in the 
same line do not require cache access. In- 
stead, the faster quadword buffer satisfies 
the request. 

A quadword buffer can be accessed in 
one 30-ns clock cycle. Upon a miss (the 
sought-for information not present in the 
quadword buffer), two additional 30-ns 
clock cycles are consumed to access the 
main cache and perform virtual address 
translation, making a total of 90 ns. 
Transferring information over the tightly 
coupled data or address buses in either 
direction between CPU and CAMMUs 
takes 15 ns. Thus, the total access time, 
including bus time for information in a 
quadword buffer, is 60 ns, and 120 ns 
from either of the main caches. 

Obviously, a cache improves the CPU/ 
memory access time only when the 
sought-after information is in the cache. 
And, of course, sometimes it is not there 
and a cache miss occurs. Then the miss- 
replacement time comes into play. Clip- 
per uses a burst-mode technique to trans- 
fer a 16-bit line from main memory over 
the system bus to the cache, typically in 

continued 





Figure 5: Twin high-bandwidth buses connect data and instruction CAMMUs to the CPU. 
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Clipper’s caches are 
closely tied to their on- 
chip MMUs and 
their translation 
lookaside buffers. 





half as many clock cycles as conventional 
microprocessors (figure 6). 

If the cache is full when a miss occurs, 
the newly fetched line must overwrite one 
of the lines already in the cache. Set asso- 
ciativity of the cache organization deter- 
mines the cache’s flexibility in deciding 
which line to overwrite. Clipper’s two- 
way set-associative caches divide each 
4K-byte total storage into two compart- 
ments of 2K bytes each. Thus, informa- 
tion from any main-memory location can 
be written to one of two locations in the 
appropriate cache. This flexibility makes 
it less likely that potentially useful infor- 
mation will be overwritten in response to 
a cache miss. 

The effectiveness of caches, in terms of 
hit rates, depends on the line size and de- 
gree of set associativity as well as the 
cache size. The curves in figure 7 indi- 
cate that Clipper’s 8K-byte total cache 
size along with its 16-byte lines and two- 
way set associativity deliver the same 90 
percent hit rate as a 128K-byte, direct- 
mapped cache with 4-byte lines, but with 
less than 10 percent of the hardware. 


BURST 
TRANSFERS 
(1 WS) 


ONE-WORD 
TRANSFERS 
(1 WS) 


With prefetching, the instruction cache’s 
hit rate can exceed 96 percent. 

Prefetching brings the next 16 bytes of 
memory into the instruction cache, in an- 
ticipation of a CPU request. Because pre- 
fetching happens concurrently with other 
CPU and CAMMU operations, this 
mechanism can deliver a 100 percent hit 
rate for bursts of in-line code sequences. 

Hit-rate and miss-replacement con- 
cepts are relatively straightforward for 
read accesses. The need to update main 
memory makes write accesses somewhat 
more complex. Clipper supports the two 
prevalent mainframe-type caching strate- 
gies—write through and copy back. 

Under a write-through strategy, main 
memory is updated each time the cache is 
altered. Hence, main memory and the 
caches always contain the same data, en- 
suring consistency. Unfortunately, write 
through doubles the cache access time 
and consumes main-memory bus band- 
width. 

Clipper also supports a copy-back 
strategy. Here, memory is updated only 
when a line that has been modified in the 
cache must be overwritten. Only then is 
the line copied back to main memory be- 
fore being overwritten. Data consistency 
during copy-back caching is assured by a 
CAMMU’s bus-watch hardware. This 
guards against stale data by fulfilling bus 
master-read requests from the cache in- 


stead of from main memory. CAMMU. 


control-register bits and bus-cycle type 

manage the bus-watch operation. 
Clipper’s caches are closely tied to 

their on-chip MMUs and their translation 


CLOCK CYCLES 


lookaside buffers (TLBs). In brief, the 
CPU generates 32-bit virtual addresses 
that the MMU/TLB translates into real 
addresses. The caching mechanism com- 
pares these real addresses with addresses 
stored in the cache. Upon a match, the 
word associated with the internal address 
is returned to the CPU. 


Pipelining 

Pipelining in the CPU has three phases. 
Parallel and concurrent operations take 
place in each phase. In order of occur- 
rence, the three phases are fetch, decode, 
and execute (see figure 8). The execute 
phase supports more concurrent opera- 
tions than either of the others, in essence 
another level of pipelining. 

In the first phase, instructions from the 
cache or from the macroinstruction unit 
are brought into the CPU’s instruction 
buffer. This buffer holds two words, or 
up to four instructions. 

Next, instructions are decoded into re- 
source requests. In response to these re- 
quests, resource management logic 
makes allocations using its table of busy 
resources. This resource scoreboard 
keeps tabs on the status of currently exe- 
cuting instructions and on which of these 
are using particular resources.: This de- 
tailed tracking lets the CPU restart in- 
structions that have caused page faults 
and continue executing instructions after 
interrupts and traps. Therefore (unlike 
software-managed pipelines), pro- 
grammed instructions, interrupts, and 
traps do not crash the pipeline. | 

The pipeline’s final phase issues in- 





Figure 6: Burst transfers typically halve cache-replacement time over that consumed with conventional microprocessor 


memory-to-cache coupling. 
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structions for execution in either the 
CPU’s three-stage integer-execution unit 
or its FPU. In this phase, up to four 
successive instructions (three integer and 
one floating point) can execute simulta- 
neously and are often overlapped. 

‘The first (L)-jnteger-execution stage 
reads into the L register’s operands from 
the general register file. Immediate oper- 
ands move directly from the instruction 
buffer to the L registers via the J register. 

The second (A) stage performs arith- 
metic, logical, and shift operations on L 
register operands or on the previous op- 
eration’s intermediate results. Results are 
stored in the A register. 

The third and final (O) stage sends the 
A register contents to the FPU, to the 
general register file for storage via the by- 
pass loop as feedback to the A stage, or to 
the data CAMMU. The bypass loop im- 
mediately feeds back to the next instruc- 
tion intermediate results of multi-instruc- 
tion calculations. The bypass loop’s 
feedback action renders pipeline flush- 
ing, and its consequent program compli- 
cations and performance degradation, 
unnecessary. Figure 9 diagrams the inter- 
action among the major functional blocks 
in the CPU/FPU. 

The load/store architecture lets only 
arithmetic and logical instructions oper- 
ate on registers. This eliminates the need 
for an address pipeline or even sepa- 
rate calculation phases for effective ad- 
dresses. Instructions requiring an effec- 
tive-address calculation simply make one 
pass through the integer-execution unit’s 

continued 
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Figure 7: Cache hit rates increase with the line size, associativity, and cache 
size in rather complex relationships. Two-way set-associative units with a 16-byte 
line average 90 percent hit rates at only 8K bytes; direct-mapped caches with 
4-byte lines require 128K bytes to approach the same 90 percent. 
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Figure 8: Pipelining relies on hardware-resource and feedback management to oversee operations in its three phases. The 
throughput is enhanced by overlapping the phases and by simultaneous operations within each phase. 
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The CPU is essentially a three-stage machine, with three phases of integer execution and a separate but concurrent 
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ALU, which in these cases has no arith- 
metic or logical function to perform and 
hence is free for computing the address. 


Memory Management 

Clipper’s virtual address space is a 
straightforward, nonsegmented linear 4- 
gigabyte (232-byte) space. Clipper creates 
three separate real address spaces via a 3- 
bit system tag. These spaces contain main 
memory, boot ROM, and I/O, which is 
memory-mapped. Also, internal operat- 
ing modes create up to four instantaneous 
virtual address spaces. 

For multiprogramming (multitasking 
or multiuser operation), Clipper provides 
1 million 4K-byte pages of virtual address 
space—corresponding to its 32-bit ad- 
dressing hardware. Similarly, the real ad- 
dress space (defined by the amount of 
physical DRAM-based memory actually 
in place) is divided into 4K-byte page 
frames. 


PAGE-TABLE 
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The 4K-byte size has distinct advan- 


tages over smaller pages: higher TLB hit 
rate; faster I/O transfers, because 4K 
bytes is an efficient unit for disk transfers; 
and access of larger physical-address 
cache concurrently with address transla- 
tion. Most important, it allows translat- 
ing virtual to real memory addresses 
(mapping) with only two-level page 
tables (see figure 10). 

Here, each process owns a unique col- 
lection of page tables that contains its 
map and thereby defines its address 
space. The base level is a one-page table 
directory containing 1024 entries. Each 
entry pinpoints a unique page table. In 
turn, page tables are also one page long 
and contain page pointers. 

Finally, each CAMMU contains two 
page directory origin registers. One 
points to the PDO base for the super- 
visor-mode program (operating system); 
the other is for the currently executing 


PAGE TABLES 


ee 
> 
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Figure 10: Virtual-to-real address translation proceeds quickly under the two- 
level system that moves from a 1024-entry page-table directory to one-page tables 


to the pages themselves. 
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process. For context switching, upon a 
process swap, the operating system sim- 
ply changes the user PDO, and the new 
user has a unique address space. The data 
and instruction CAMMUs, each with two 
PDOs, generate four memory maps, 
which in turn create four possible simul- 
taneously active address spaces: super- 
visor and user spaces for instructions and 
data alike. 

To save overhead on page-table look- 
ups, the CAMMUs cache address trans- 
lations of 128 frequently used pages in 
TLBs. Concurrently with cache access in 
its CAMMU, the TLB is searched. 
Hence, accesses to a directory in memory 
or a page table are made only upon TLB 
misses. 

Processes can share pages by simply 
putting entries for the same real-page 
frame in each process-page table. The 
supervisor can access user pages similar- 
ly. Also, the supervisor can use a user 
PDO for its operand’s addresses and 
thereby gain fast access to the entire user 
address space. 


Demand Paging 

Clipper’s architecture includes four key 
features in support of demand-paged vir- 
tual memory. A fault bit in each page- 
table entry monitors for the CAMMU the 
page’s presence in main memory. Also, 
CAMMUs can activate a dedicated-page 
fault trap upon attempting to access an 
absent page. And, in the face of a page 
fault, the instruction being attempted can 
be aborted and reexecuted or resumed 
after the OS has loaded the missing page. 

Finally, referenced and dirty bits in the 
page-table entries help choose the best 
candidate for a newly swapped-in page to 
replace. The R bit indicates to the OS 
how recently its page has been used. 
Thereby, the OS has grist for a page- 
replacement algorithm based on usage. 
The D bit indicates whether or not a 
main-memory page has been modified. If 
it has, upon replacement it must be writ- 
ten back onto the disk. If it has not, the 
OS can discard it. 

Notwithstanding the new features Clip- 
per brings to microprocessors, software 
compatibility is proving to be little prob- 
lem. In our view, the preponderance of 
software for 32-bit microprocessors is be- 
ing written in high-level languages. The 
CLIX operating system, derived from the 
UNIX System V Release 3.0 operating 
system, and optimizing compilers for 
popular programming languages promise 
a relatively simple port of most existing 
and future programs. For developing pro- 
prietary applications, Clipper comes 
with a complete set of software-develop- 
ment utilities, including interactive de- 
buggers and simulators. @ 
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- Classic design methods converge 
in the MC68030 microprocessor 


COMPARISONS OF THE relative archi- 
tectural merits of the reduced instruction 
set computer and complex instruction set 
computer methods might prove to be one 
of the more interesting computer science 
debates of the late 1980s. However, these 
two seemingly disparate views of the cor- 
rect way to build microprocessors might 
not be as far apart as they seem. This arti- 
cle examines the Motorola 68030 micro- 
processor with respect to the RISC-like 
features in this classic CISC machine. 

Before delving into the MC68030’s in- 
nards, I will encapsulate the RISC and 
CISC strategies. The term RISC is some- 
what of a misnomer. The acronym RISC 
has two commonly accepted meanings. 
The older meaning is reduced instruction 
set computer, and the newer is reusable 
instruction set computer. Both names 
imply that RISC has something to do with 
optimizing the microprocessor’s instruc- 
tion set. While this is true, it is also mis- 
leading, since RISC is much more than 
simply an architecture that necessitates a 
smaller or more efficient instruction set. 
Likewise, CISC is more than just an ar- 
chitecture that embodies complex or 
high-semantic-content instructions. It is 
much more reasonable to label RISC and 
CISC as implementation methodologies 
than as architectural constraints. 


The Tenets of RISC 

The major tenet of RISC is the investiga- 
tion of the assignment of system function- 
ality within an architecture. RISC strate- 
gies normally lead to the offloading of the 
more complex or infrequently used in- 
structions onto the compiler. The in- 





Thomas L. Johnson 


structions and addressing modes that are 
left on the RISC processor are those fre- 
quently used by code generators em- 
bedded in compilers, those most advanta- 
geous to a language, and those that are 
vastly more efficient if implemented in 
hardware. Overall, the following RISC 
implementation features lead to improved 
performance: 


Single-cycle operation for every in- 
struction—In order to operate in a single 
cycle, an instruction must be either rela- 
tively simple or backed up by additional 
hardware logic. Whether simple or not, 
single-cycle instructions yield rates of 
many millions of instructions per second. 
High MIPS rates by themselves do not di- 
rectly indicate the amount of work ac- 
complished but only how fast the engine 
is running to accomplish the work. A 
good analogy is a car engine’s revolutions 
per minute versus the same car’s miles 
per hour. The lower the gear, the higher 
the rpm for a given mph. The rpm by it- 
self will not let you determine how long it 
will take to travel a distance, only how 
hard the engine will work during that 
time. 

Load/store design—This point dictates 
that only load and store operations should 
reference external memory. This tenet 
lets all other implemented instructions 
follow the criterion of single-cycle opera- 
tion since they will then have to operate 
only on on-chip registers (memory refer- 
ences can be indeterminate in length of 
time due to normal memory-access de- 


. lays such as refresh and direct-memory- 


access controllers). 
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Hard-wired control—Microcoded ar- 
chitectures have variable-length instruc- 
tion size and execution time. In a CISC 
machine, microcode is a highly desirable 
trait because it lets the designer imple- 
ment many flexible, complex (high se- 
mantic content) instructions and address- 
ing modes in minimal silicon real estate. 
In a RISC machine, however, microcode 
is less desirable. Microcode doesn’t lend 
itself to single-cycle operations as direct- 
ly as dedicated hardware logic, since the 
microprocessor’s hardware has to 
dynamically interpret microcode. 

Relatively few instructions and ad- 
dressing modes—Adherence to this point 
facilitates the implementation of both 
single-cycle operation and hard-wired 
control with a relatively small investment 
in design time and silicon real estate. 
More easily decoded instructions plus 
simpler addressing modes can yield fast- 
er execution. 

Fixed instruction format—This tenet, 
once again, simplifies the design of the 
control circuitry. Less complex circuitry 
can normally run faster overall. 

More compile-time efforts—This crite- 
rion states that much of the static run- 
time complexity can and should be han- 
dled prior to run time by an optimizing 
compiler. An example of this would be 
the generation of an intermediate lan- 
guage by all language compilers, which 

continued 
Thomas L. Johnson is manager of inter- 
nal technical communications at Motor- 


ola Inc., OE33, 6501 William Cannon 
Dr. West, Austin, TX 78735-8598. 


APRIL 987 * BYTE 153 


—— 


— 








Inquiry 322 


MELTING POT 
SS ee 


Bis Text rN Bar Codes in turn is compiled by an intermediate op- 


timizing compiler into object code (i.e., 
common pseudocode rear-ends on all 
compilers). This software technology can 
also be used to great advantage on CISC 
machines, and this capability is just now 
coming into vogue. 

Minimal pipelining—Pipelining in a 
CISC machine allows more efficient use 
of the available bus bandwidth and lets it 
produce performance equivalent to a 
RISC architecture. 
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Suffice it to say that a cache memory is a 
fast-access (relative to the main-memory 
system) local memory where copies are 
kept of the most recently accessed main- 
memory locations, along with some 
bookkeeping data. 

Another recent approach to limiting the 
effects of the von Neumann bottleneck is 
alternate architectures. The most widely 
known of these, the Harvard architec- 
ture, presents separate paths to the mem- 
ory system for instructions and data. This 
technique almost doubles the available 
bus bandwidth, allowing the processor to 
wait on the memory subsystem less often 
and allowing other attached processors 
(i.e., DMA processors) to affect the main 
processor less. Figures 1 and 2 show 
these two techniques. Designers can use 
cache memories, Harvard architectures, 
and other bandwidth-saving techniques to 
great advantage on both CISC and RISC 
processor implementations. 


The MC68030 ) 
The Motorola 68030 builds upon the ar- 
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Figure 2: The Harvard-architecture method of linking a processor to memory. 





tures between the software and hardware 
elements. 


do not attempt to limit the programmer- 
/compiler to load/store architectures. 
However, they do incorporate many high- 
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level constructs to assist high-level lan- 
guage-compiler writers (features such as 
simple stacking primitives for procedure 
calls/returns). Due to the circuit com- 
plexity that results from CISC implemen- 
tations, much more time is normally in- 
volved in the design/debugging of the 
processor and much more care must be 
taken to ensure proper operation at high 
clock rates. | 

Overall, the trade-offs between the tra- 
ditional CISC and RISC implementation 
philosophies are normally ones of circuit 
complexity and assignment of system fea- 
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A common problem incurred in both 
RISC and CISC microprocessor design is 
the von Neumann bottleneck, where 
microprocessors process information 
faster than the memory system can supply 
it. This problem has several solutions. 
The most traditional approach is for the 
system designer to implement some sort 
of cache to act as a buffer between the 
main memory and the microprocessor. 
This cache can take several forms, and J 


will not attempt to discuss the relative 


merits of the various cache designs here. 


chitecture born in 1979 with the release 
of the MC68000. For a complete discus- 
sion of the M68000 series of processors, 
see my article ‘‘A Comparison of 
MC68000 Family Processors” in the 
September 1986 BYTE. 

While each of the M68000 micro- 
processors is based on the same CISC ar- 
chitecture, each has one very RISC-like 
feature: a large, undedicated, full-width 
register complement. The supervisory- 
and user-level programming models for 
these processors are shown in figures 3 
and 4, 
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Figure 3: The user programming model’s registers in the M68000 series. 





In addition to the register set, the 
M68000 series has specific hardware de- 
signed to make the processor execute in- 
structions as efficiently as possible. This 
hardware includes full-width internal 32- 
bit data and address buses regardless of 
the size of the external paths, separate 
ALUs for addresses or data that allow si- 
multaneous address and data calcula- 
tions, 3-byte instruction pipelines for the 
MC68000/008/010, and a 3-word in- 
struction pipe for the MC68020. The 
MC68020 also includes full on-chip sup- 
port for the coprocessor interface to allow 
the attachment of closely coupled co- 
processors (such as the MC68881 or 
MC68882 floating-point coprocessors or 
the MC68851 paged memory manage- 
ment unit), a 256-byte on-chip instruc- 
tion cache memory, and a 32-bit bus data 
buffer that acts as a prestaging area for 
the instruction pipeline and a holding 
area for data transfers. 


RISC-like Features in the MC68030 
The MC68030 maintains full upward 
object-code compatibility for user-level 
programs and is still a full virtual proces- 
sor, capable of both virtual memory and 
virtual machine operation. Note that the 
MC68030 maintains the same program- 
ming models as the MC68020 and adds 
functionality to the supervisory-level reg- 
isters (see figure 5). The MC68030 has 
all the features and functionality of the 
MC68020. 

Many of the MC68030’s features are 
those you might expect to find only in 
RISC machines. This reinforces the con- 

continued 
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Figure 4: The additional registers in the supervisory programming model. 
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cept that RISC is not an architecture but 

MC68020 rather an implementation method that can 

be applied equally well to CISC proces- 

. sors. As I look at some of the MC68030 

' First, although the MC68030 incorpo- 

J rates a two-cycle execution unit rather 

than the single-cycle EU found in some 

RISC processors, the time required for an 

Vector base register instruction to execute can be as little as 

zero Clock cycles. This is due to the over- 

lapping nature of internal/external bus 

activity and the autonomy of internal pro- 
cessor resources, 

_ Working in conjunction with the two- 
cycle EU is the unique two-level micro- 
code structure, which is perhaps the 
MC68030’s single most non-RISC fea- 
ture. The initial instruction decode gen- 
erates a call into the first level of micro- 
code. Here, specific nanocode words are 
called to generate the proper control sig- 
nals for instruction execution. 

Due to the methods employed, adding 
CPU root pointer new instructions—or modifying the way 
in which current instructions execute—is 
simply a matter of modifying the micro- 


31 


Supervisor root pointer code; you don’t need to modify the exe- 

cution hardware. Once you have verified 

Transparent translation 0 : execution hardware, you can verify all in- 
structions by verifying the contents of the . 

Transparent translation 1. microstores. Simply put, if you were to 

slice the MC68030 horizontally across 

the die, you could consider the upper half 
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Figure 6: A functional block diagram of the MC68030 microprocessor. 
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The MC68030 also features a 256-byte 
data cache to complement the 256-byte 
instruction cache. The caches are ar- 
ranged as 16 lines of four longwords (32- 
bit values) each, with each longword sep- 
arately accessible. Whereas the instruc- 
tion cache is read-only, the data cache 
has a user-seleetable write-allocate 
policy to help prevent the stale-data phe- 
nomenon that occurs when data is written 


to the cache and not to main memory.. 


Due to a combination of the write-allo- 
cate policy and a cache-content-freezing 
mechanism, you treat this cache like a 64- 
entry by 32-bit extension to the normal 
eight data registers. This means that the 
on-chip complement of registers can ap- 
pear to be a total of 80 registers in either 
the user or supervisory programming 
model. This is more than many current 
RISC microprocessors. 

The two design goals for the 
MC68030’s caches were to reduce the 
processor’s external bus activity over that 
of the MC68020 and to increase effective 
CPU throughput, even though larger 
memory sizes or slower memories in- 
creased average access times. The 
throughput increase cirectly attributable 
to the MC68030’s instruction and data 
caches is derived by three basic means. 
On-chip caches can be accessed in less 
time than external memories, providing 


improved access times for data residing in | 


the caches. 

The burst-fill capability of the caches 
lets data be found in the caches even 
though they have never been accessed be- 
fore, lowering the average access times 
for data in the cache even further. In 
burst-fill operation, the MC68030 will 
always attempt to completely fill a cache 
line. To accomplish this, it might request 
a burst fill from external hardware during 
a data/instruction read. If the external 
hardware can operate in a burst mode 
for this access, it will respond to the 
MC68030 to indicate this fact. The 
MC68030 -will then simply latch data on 
the trailing edge of each successive clock 
until the cache line is filled. 


Harvard Architecture 

The structure of the instruction and data 
cache memories and the way in which 
they are incorporated into the overall 
microprocessor architecture make the 
MC68030 the first CPU to use a modified 
Harvard architecture internally on a sin- 
gle chip. The autonomous nature of the 
caches lets accesses to both caches and 
external accesses occur simultaneously 
with instruction execution. This parallel- 
ism of instruction execution, along with 
instruction and data accesses to both 
caches and the external world, is en- 
hanced to allow multiple instructions to 
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execute concurrently internally along 
with a single data access to the external 
world. 

The microprocessor has three separate 
internal 32-bit buses for data and instruc- 
tion movement. Consequently, there are 
separate paths to memory for both in- 
structions and data within the chip. Be- 
cause of these multiple buses, the execu- 
tion unit can access data from the data 
cache and instructions from the instruc- 
tion cache while simultaneously fetching 
an operand from the external world. Until 
now, this modified Harvard capability 
has been almost entirely a feature of 
RISC machines. 


Three-Stage Pipeline 

The MC68030 uses a three-stage in- 
struction pipeline much like that of the 
MC68020. The size of the pipeline is a 
trade-off between increasing the perfor- 
mance of the on-chip EU on interpreted 
microcode and the frequency of branches 
in normal code. A pipe that is too long 
must be repeatedly cleared and refilled as 
program branches take place. However, a 
microcoded processor without any type 
of instruction pipeline uses too much 
time in the sequential execution of in-line 
instructions. It is, therefore, helpful in 
microprogrammed architectures to let the 
processor work on the various phases of 
instruction execution simultaneously 
whenever possible without unduly wast- 
ing time on decoding instructions that 
will never be used due to branching. Sim- 


The on-chip complement 
of registers can appear to 
be a total of 80 registers 
in either model. 





ulation studies show that a three-stage 
pipeline is optimal for the M68000 
architecture. 

To further reduce the effects of the von 
Neumann bottleneck, the MC68030 can 
run three different types of external bus 
cycles on a bus cycle by bus cycle basis. 


_ These three types are asynchronous bus 


cycles (the same type of bus cycle run on 
the original MC68000), synchronous 
two-cycle bus cycles, and the burst bus 
cycles—the previously discussed bus 
cycle that works well with newer RAM. 
technologies such as static-page and nib- 
ble-and-column-mode dynamic RAMs 
and allows the transfer of up to four 32-bit 
values in as little as five clock cycles. The 
burst-fill type of bus cycle is requested by 
the MC68030 whenever possible. How- 
ever, external logic is free to choose the 
type of bus cycle needed. 


On-Chip Hardware 
for Instruction Support 


One of the most basic concepts of RISC | 


architectures is that of hardware support 
continued 
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Figure 7: The location on the chip of the various MC68030 circuits. 
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for instructions. The MC68020/ 
MC68030, although not RISC proces- 
sors, have an impressive amount of on- 
chip hardware for special instructions. 
This support includes a 32-bit barrel 
shifter that lets the processor shift or ro- 
tate a 32-bit value up to 32 bits in a single 
clock cycle. Additionally, all ALUs on 
the devices are a full 32 bits wide. Also 
assisting in overall execution is the 
MC68851 paging MMU, brought on- 
chip in the MC68030. This paging 
MMU, with on-chip translation descrip- 
tor cache, lets the MC68030 generate 
physical addresses for the external mem- 
ory subsystem with no additional delay 
(address translation occurs in parallel 
with other processor activities). 

Finally, one of the bastions of RISC is 
that due to the simpler nature of the hard- 
ware, it can obtain substantially higher 
clock rates. This is not without its prob- 
lems. To use these higher clock rates, ex- 
ternal memory must be made to respond 
without imposing so many wait states that 
the faster clock becomes meaningless. 
The MC68030 has a design frequency of 
20 megahertz. The original design fre- 
quency of the MC68000 was 8 MHz, and 
it is currently offered by Motorola in 
12.5-MHz frequencies. The design fre- 
quency of the MC68020 was 16.67 MHz 
and is currently offered in speeds to 25 
MHz. If past performance is any indica- 
tor, it is safe to assume that the MC68030 
will be offered in speeds substantially 


higher than 25 MHz and average perfor- 


mance of much greater than 5 MIPS. 


Conclusion 

It is a serious mistake to assume that the 
acronym RISC stands for higher perfor- 
mance than the acronym CISC. Instead, 
it is more accurate to say that RISC rep- 
resents a step forward in defining a set of 
methods that can be used to advantage in 
the implementation of any microproces- 
sor architecture. The RISC feature that I 
believe holds the most promise for the 
future is in the area of division of overall 
system responsibility between the micro- 
processor’s hardware and intelligently 
written high-level language compilers. It 
is important to remember that, although 
a genre of applications exists for which 
assembly language coding is essential 
due strictly to performance—especially 
real-time performance—the number of 
applications that fall within this genre is 
diminishing. In the end, regardless of 
whether an architecture is labeled RISC 
or CISC, it is up to the system implemen- 
tors to choose the architecture that most 


directly addresses their concerns for the 


highest, most cost-effective performance 
and reusability of their current systems 
and applications software. @ 
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